1. Introduction

This document contains the details of my re-analysis of the sample mixups in the Attie eQTL data, to accompany the manuscript describing the work.

I will suppress much of the R code from the html version of this document; for details, see the asciidoc source file.

Here are the versions of R and the packages that I’m using:

> R.version$version.string
[1] "R version 3.2.2 (2015-08-14)"
> library(qtl)
> qtlversion()
[1] "1.36-6"
> library(lineup)
> lineupversion()
[1] "0.37-6"
> library(broman)
> bromanversion()
[1] "0.59-5"
> library(ascii)
> paste(unlist(packageVersion("ascii")), collapse=".") # ascii version
[1] "2.1"

2. Preliminary data cleaning

2.1. Genotype data

There were originally 554 mice. There were 3 mice with failed genotypes. We omitted the genotypes for another 7 mice, as the genotype data were bad. (They showed a high rate of apparent genotyping errors, an unusually large proportion of homozygous gneotypes, and many apparent crossovers.) That leaves 544 mice with genotype data.

The genotype data were cleaned to remove markers not segregating in the cross, to correctly align the BTBR and B6 alleles, and to remove some badly behaved markers (data not shown). Remaining were 2060 markers, including 20 on the X chromosome.

2.1.1. Sex swaps

The cross performed was (BTBR × B6) × (BTBR × B6), with females listed first. Thus, females should be homozygous BTBR or heterozygous while males are hemizygous B6 or BTBR.

We identified 33 mice whose X chromosome genotypes didn’t match their sex: There were 19 females whose genotype data indicated them to be males (many homogyzous B6 calls and no hets) and 14 males whose genotype data indicated them to be females (many heterozygous calls).

There are 112 mice with BTBR homozygous calls for the entire X chromosome, which is compatible with both sexes.

2.1.2. Sample duplicates

We identified 6 pairs of clear sample duplicates in the genotype data.

Mouse1

Mouse2

# matches

# typed

% mismatches

Mouse3259

Mouse3269

2017

2022

0.2

Mouse3267

Mouse3362

1933

1966

1.7

Mouse3287

Mouse3290

2012

2016

0.2

Mouse3317

Mouse3318

1964

1996

1.6

Mouse3353

Mouse3354

2026

2031

0.2

Mouse3553

Mouse3559

1998

2008

0.5

One pair, Mouse3267 and Mouse3362, were actually omitted from the genotype data as being badly behaved.

We omit one individual from each pair from the genotype data.

2.2. Gene expression data

We have gene expression microarray data on six tissues: adipose, gastroc, hypo, islet, kidney, and liver. These were two-color Agilent arrays, with the probes being 60-mers. For each array, one color was a tissue-specific pool and the other was an individual sample. The expression levels were quantified as the “mlratio”, which is approximately the log10 ratio of the individual to the pool, though with some corrections that I don’t fully understand, and with values always between -2 and +2.

There were approximately 500 expression arrays performed for each tissue. A number of outlying arrays were omitted from each group. For hypo, there was an additional batch of 119 poorly behaved arrays that were omitted in later analyses, but they are included here as they contain a number of detectable sample mixups.

# arrays

# omitted

# kept

adipose

497

4

493

gastroc

498

2

496

hypo

494

1

493

islet

499

1

498

kidney

482

1

481

liver

491

1

490

3. Lining up expression arrays

In this section, we line up the expression arrays.

The first step is to identify probes that are correlated between tissues. (For each array, there is a total of 40572 probes.) For each pair of arrays, we find the samples that are in common and calculate the correlation between tissues for each probe.

The following figure contains density estimates of the correlation distributions for the 15 pairs of tissues.

Figs/fig-corr_dist.jpg

Here are counts, for each tissue pair, of probes with large between-tissue correlation.

Tissue1

Tissue2

corr > 0.7

corr > 0.75

corr > 0.8

corr > 0.9

adipose

gastroc

199

143

99

30

adipose

hypo

110

72

50

7

adipose

islet

216

159

106

38

adipose

kidney

255

186

135

51

adipose

liver

159

113

79

19

gastroc

hypo

79

55

43

10

gastroc

islet

180

132

92

33

gastroc

kidney

219

164

109

43

gastroc

liver

149

102

71

23

hypo

islet

127

82

57

10

hypo

kidney

131

92

60

17

hypo

liver

63

46

33

6

islet

kidney

269

200

146

42

islet

liver

152

97

64

24

kidney

liver

245

155

106

30

For each pair of tissues, we select the probes with between-tissue correlation > 0.75 and then calculate the between-individual correlations (the correlation between an individual in one tissue and an individual in the second tissue), using that subset of probes.

The selection of probes with high cross-tissue correlation is intended to pull out probes with strong signal. The between-individual correlations indicates the appropriate alignment of samples.

Then, for each tissue, we summarize the five correlation matrices, comparing that tissue to each other tissue, by taking, for each pair of individuals, the median of the five tissue-tissue correlations. This relies on the assumption that different samples are mixed up in the different tissues.

3.1. Adipose

There were 493 individuals assayed for adipose expression, after omitting 4 bad arrays. The analysis described above produces a 493 × 527 of median correlations, with the rows corresponding to the adipose samples.

The following figure contains histograms of the self-self correlations and self-non self correlations (or really medians of correlations). The self-self correlations are mostly large, but there are 5 samples with median self-self correlations < 0.5. The two modes in the self-nonself correlations correspond to mixed-sex and same-sex pairs.

Figs/fig-adipose_expr_hist_fig.jpg

For each row, we pull out the maximum correlation (or really the maximum of the median correlations); in the following, we plot the self-self correlations against these maxima. For most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 5 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.

Figs/fig-adipose_expr_selfmax_fig.jpg

For the 5 problem samples, the self-self correlation is low but in each case there is another sample for which correlation is high. We infer that Mouse3583 and Mouse3584 were swapped, and then there is a rotated trio, Mouse3187 → Mouse3188 → Mouse3200 → Mouse3187.

Scatterplots

Let’s look at detailed scatterplots of the expression data for the sample swaps. We pull out the probes that show correlation > 0.6 between adipose and at least one other tissue, and then make a scatter plot of a sample from one tissue against a sample from the second tissue. Points correspond to probes, with the sizes of the points indicating the probes' correlations, across individuals, for that pair of tissues.

First, Mouse3583 and Mouse3584. You can see that, in all tissues, Mouse3583 is not correlated with itself but is correlated with Mouse3584, and vice versa.

Figs/fig-adipose_3583_v_3584.jpg

Now the rotated trio, Mouse3187 → Mouse3188 → Mouse3200 → Mouse3187. Note that Mouse3188 was swapped with Mouse3179 in hypo (see below). Mouse3187 in hypo seems correct but poorly behaved. Mouse3187 and Mouse3188 were not assayed for kidney expression.

Figs/fig-adipose_3187_3188_3200.jpg

3.2. Gastroc

There were 496 individuals assayed for gastroc expression, after omitting 2 bad arrays.

We jump to the plot of the self-self correlations versus the maximum correlation. Again, for most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 2 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.

Figs/fig-gastroc_expr_selfmax_fig.jpg

There is an additional potential problem highlighted in orange, having low self-self correlation, and being slightly more correlated to another sample than to itself. But this is Mouse3188, which was involved in sample mix-ups in both adipose and hypo. We believe it’s correctly labeled in gastroc.

The 2 problem samples are inferred to be a sample swap, Mouse3655 ↔ Mouse3659.

Scatterplots

Let’s look at detailed scatterplots of the expression data for the sample swap, Mouse3655 and Mouse3659. You can see that, in all tissues, Mouse3655 is not correlated with itself but is correlated with Mouse3659, and vice versa.

Figs/fig-gastroc_3655_v_3659.jpg

3.3. Hypo

There were 493 individuals assayed for hypo expression, after omitting 1 bad array.

Here is the plot of the self-self correlations versus the maximum correlation. Again, for most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 17 samples with problems are highlighted in green, but there are too many to attach labels.

Figs/fig-hypo_expr_selfmax_fig.jpg

Here are the summary statistics for the problems.

maximum corr

self corr

Inferred label

Mouse3592

0.94

0.57

Mouse3594

Mouse3594

0.94

0.57

Mouse3592

Mouse3369

0.94

0.28

Mouse3367

Mouse3367

0.86

0.41

Mouse3369

Mouse3590

0.94

0.72

Mouse3589

Mouse3589

0.93

0.73

Mouse3590

Mouse3454

0.93

0.46

Mouse3452

Mouse3452

0.89

0.42

Mouse3454

Mouse3451

0.93

-0.22

Mouse3449

Mouse3449

0.91

-0.22

Mouse3451

Mouse3382

0.85

0.54

Mouse3381

Mouse3347

0.84

0.16

Mouse3348

Mouse3348

0.75

0.09

Mouse3347

Mouse3179

0.75

0.15

Mouse3188

Mouse3188

0.74

0.27

Mouse3179

Mouse3210

0.72

0.25

Mouse3208

Mouse3208

0.70

0.23

Mouse3210

Mouse3589 and Mouse3590 have reasonably high self-self correlation (0.7) but are each much more highly correlated to the other (0.9).

We infer a set of nine sample swaps:

Mouse3179 ↔ Mouse3188

Mouse3208 ↔ Mouse3210

Mouse3347 ↔ Mouse3348

Mouse3367 ↔ Mouse3369

Mouse3381 ↔ Mouse3382

Mouse3449 ↔ Mouse3451

Mouse3452 ↔ Mouse3454

Mouse3589 ↔ Mouse3590

Mouse3592 ↔ Mouse3594

There is one wrinkle here: Mouse3381 was not assayed for hypo expression, but really it was Mouse3382 that was not assayed; Mouse3381 was assayed but was labeled Mouse3382.

Scatterplots

Let’s look at detailed scatterplots of the expression data for all of these sample swaps.

First, Mouse3179 ↔ Mouse3188. Recall that Mouse3188 was part of a rotated trio in adipose (Mouse3187 → Mouse3188 → Mouse3200 → Mouse3187). These are not too pretty, but it is a clear swap Mouse3179 ↔ Mouse3188 in hypo.

Figs/fig-hypo_3179_3188.jpg

Mouse3208 ↔ Mouse3210: These are rather messy, but it is still a clear sample swap.

Figs/fig-hypo_3208_3210.jpg

Mouse3347 ↔ Mouse3348: Again, these are rather messy, but still a clear sample swap.

Figs/fig-hypo_3347_3348.jpg

Mouse3367 ↔ Mouse3369

Figs/fig-hypo_3367_3369.jpg

Mouse3381 ↔ Mouse3382 (but note that Mouse3381 was not intended to be done in hypo).

Figs/fig-hypo_3381_3382.jpg

Mouse3449 ↔ Mouse3451

Figs/fig-hypo_3449_3451.jpg

Mouse3452 ↔ Mouse3454

Figs/fig-hypo_3452_3454.jpg

Mouse3589 ↔ Mouse3590

Figs/fig-hypo_3589_3590.jpg

Mouse3592 ↔ Mouse3594

Figs/fig-hypo_3592_3594.jpg

3.4. Islet

There were 498 individuals assayed for islet expression, after omitting 1 bad array.

Here is the plot of the self-self correlations versus the maximum correlation. Again, for most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 3 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.

Figs/fig-islet_expr_selfmax_fig.jpg

There is an additional potential problem highlighted in orange, having low self-self correlation, and being slightly more correlated to another sample than to itself, but this is Mouse3188 again, which was involved in sample mix-ups in both adipose and hypo. We think it is correctly labeled in islet.

The 3 problem samples are inferred to be a sample swap, Mouse3598 ↔ Mouse3599, plus a sample duplicate: Mouse3295 was run correctly but also with the label Mouse3296 (more on this below).

Scatterplots

Let’s look at detailed scatterplots of the expression data for the sample mixups.

First, the sample swap, Mouse3598 ↔ Mouse3599. You can see that, in all tissues, Mouse3598 is not correlated with itself but is correlated with Mouse3599, and vice versa.

Figs/fig-islet_3598_v_3599.jpg

Now we turn to the apparent duplicate, the Mouse3296 sample really being Mouse3295. Mouse3296 in islet is not correlated with itself but is correlated with Mouse3295. Mouse3295 in islet looks fine: It’s correlated with itself and not correlated Mouse3296 in the other tissues. We’ll revisit this below in the section on Sample duplicates.

Figs/fig-islet_3295_v_3296.jpg

3.5. Kidney

There were 481 individuals assayed for kidney expression, after omitting 1 bad array. Note that there were 27 samples that were assayed for kidney expression but no other tissue.

Here is the plot of the self-self correlations versus the maximum correlation. Again, for most samples, the self-self correlation is the maximum; these samples appear to be aligned correctly. The 2 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.

Figs/fig-kidney_expr_selfmax_fig.jpg

There is an additional potential problem highlighted in orange (Mouse3484), being a bit more correlated to another sample (Mouse3503) than to itself. These two arrays are oddly behaved and will be investigated further below.

The 2 problem samples are inferred to be a sample swap, Mouse3510 ↔ Mouse3523.

Scatterplots

Let’s look at detailed scatterplots of the expression data for the sample mixups.

First, the sample swap, Mouse3510 ↔ Mouse3523. You can see that, in all tissues, Mouse3510 is not correlated with itself but is correlated with Mouse3523, and vice versa.

Figs/fig-kidney_3510_v_3523.jpg

Now we show scatterplots for the odd pair, Mouse3484 and 3503. They all seem correlated with each other. We’ll revisit this below in the section on Sample duplicates.

Figs/fig-kidney_3484_v_3503.jpg

3.6. Liver

There were 490 individuals assayed for liver expression, after omitting 1 bad array.

Here is the plot of the self-self correlations versus the maximum correlation. The 2 samples with problems are highlighted in green, and we indicate the sample label and then what we infer to be the correct label.

Figs/fig-liver_expr_selfmax_fig.jpg

The 2 problem samples are inferred to be a sample swap, Mouse3142 ↔ Mouse3143 (though Mouse3143 was not intended to be assayed for liver), and a sample duplicate: Mouse3136 was assayed correctly and also with the label Mouse3141 (more below).

Scatterplots

Let’s look at detailed scatterplots of the expression data for the sample mixups.

First, the sample swap, Mouse3142 ↔ Mouse3143, but with Mouse3143 not intended to be assayed in liver. You can see that, in all tissues, Mouse3142 is not correlated with itself but is correlated with Mouse3143.

Figs/fig-liver_3142_v_3143.jpg

Now we turn to the apparent duplicate, the Mouse3141 sample really being Mouse3136. We’ll revisit this in the next section, on Sample duplicates.

Figs/fig-liver_3141_v_3136.jpg

3.7. Sample duplicates

Above, we had identified several apparent sample duplicates (which might be called “unintended technical replicates”). We now look at these in more detail.

Consideration of all probes on an array will not be informative, as most genes are not expressed in a tissue and so the measurements are just noise. To identify probes that with signal, we pull out those probes with correlation > 0.6 against at least one other tissue.

For each tissue, there are non-duplicate pairs with correlations that reach 0.9, while each of islet, kidney and liver have a pair of samples with correlation > 0.98.

Here are the top 5 between-tissue correlations for each tissue.

adipose

gastroc

hypo

islet

kidney

liver

0.86

0.90

0.89

0.98

0.99

0.98

0.85

0.88

0.88

0.86

0.87

0.89

0.85

0.88

0.87

0.86

0.87

0.87

0.85

0.88

0.87

0.86

0.86

0.87

0.84

0.87

0.87

0.84

0.86

0.87

In islet, Mouse3296 is a duplicate of Mouse3295.

Figs/fig-islet_dup_3295_v_3296.jpg

In liver, Mouse3141 is a duplicate of Mouse3136.

Figs/fig-liver_dup_3141_v_3136.jpg

In kidney, Mouse3484 and Mouse3503 remain a bit of a conundrum, since they each are correlated with the other in all tissues.

Figs/fig-kidney_dup_3484_v_3503.jpg

This feature is specific to the two kidney samples. For example, consider the between-individual correlations for all 15 tissue pairs. Here, we’ll grab the subset of 76 probes with correlation > 0.6 for all tissue pairs.

Here are plots of the 15 between-tissue correlations for these two mice. The pairs involving kidney are in red.

Figs/fig-between_tissue_corr_3484_3503_fig.jpg

I still don’t know what to make of this. It’s interesting to note that the red points are at about the same height in all four figures, while the blue points are high for 3484:3484 and 3503:3503 but low for 3484:3503 and 3503:3484.

Why should this pair of kidney samples have such similar expression? I’m inclined to think that a mixture of the two RNAs were placed on both arrays.

I will omit these two arrays from further analyses.

3.8. Summary

Here’s a summary of the arrays omitted and found do be in mislabeled, for each tissue.

total

omit

okay

error

adipose

497

4

488

5

gastroc

498

2

494

2

hypo

494

1

476

17

islet

499

1

495

3

kidney

482

3

477

2

liver

491

1

488

2

I now omit the relevant arrays, combine the duplicates in liver and islet (by simply taking the average of the mlratios for each probe) and then re-name the arrays as inferred.

4. Line up expression with genotypes

We now turn to the alignment of the expression arrays with the genotype data.

We first pull out all probes that have an annotated genomic position to an autosome. There are 36364 such probes.

We next identify the nearest marker or pseudomarker for each probe. QTL genotype probabilities were calculated at the markers and at 0.25 cM steps along the genome.

We then calculate a local LOD score for each probe in each tissue, at the marker/pseudomarker closest to the probe.

Here is a table of the numbers of probes in each tissue with local LOD > 100.

adipose

gastroc

hypo

islet

kidney

liver

103

88

48

135

88

54

We grab the probes with local LOD score > 100, construct a k-nearest neighbor classifier, for classifying eQTL genotype from expression, for each eQTL, and finally calculate the proportion mismatches between predicted and observed eQTL genotypes. (We use k=20 and a minimum class probability of 0.8 to make a decision.)

Our measure of distance between samples is the proportion of mismatches between the observed and predicted eQTL genotypes. Consider the distance matrix with DNA samples as rows and mRNA samples as columns. It’s particularly informative to make a scatterplot of the minimum value in a row against the self-self distance. These plots are analogous to the plots of maximum correlation vs. self-self correlation that we had considered in aligning the expression arrays.

In the following, we display such a plot for each tissue considered separately, plus with the distances matrices combined (by simply taking the overall proportion of mismatches, across all tissues).

Figs/fig-dgve_figure.jpg

In each case, the majority of samples are correctly labeled; for these, with the results for all tissues combined, the self-self distance is the minimal distance (the points in blue). There are 435 such samples.

The green points correspond to samples that were incorrectly labeled, but that are fixable: the self-self distance is large, indicating a clear error, but the minimal distance is small; there is an mRNA sample to which the DNA sample appears to correpond. With the results combined across tissues, there are 72 such samples.

The red points correspond to samples that were incorrectly labeled and not fixable: the self-self distance is large, indicating a clear error, and the minimal distance is also large; there is no mRNA sample to which the DNA sample appears to correspond. With the results combined across tissues, there are 12 such samples.

There are an additional 20 DNA samples for which expression assays were not intended to be done. These are not represented in the above figure, as their self-self distances are missing. However, we can calculate the minimum distance between these DNA samples and all mRNA samples. There are 12 such DNA samples that have < 10% mismatches with one of the mRNA sample; these likely are fixable sample mixups. The other 8 DNA samples have no mRNA sample that seems to correspond and so may be correct (though we cannot be certain).

4.1. Potential discrepancies

Note that the points are colored based on the results for the six tissues combined. For each tissue there are numerous green points (with the combined data, viewed as “fixable”) that have high minimum mismatches. These are cases where the inferred sample was not run in that particular tissue but was run in others: For the most part, these are samples that were run only in kidney or run in other tissues but not kidney. Hypo is a bit messy, though, with less-clear separation between “fixable” and “not fixable”.

Here is a table of these discrepancies. The numbers are the minimal percent mismatches between observed and inferred eQTL genotypes, by tissue and then with all tissues combined. The “Note” column explains why some tissues show a discrepancy; most are due to missing expression data.

Sample

Inferred

Note

adipose

gastroc

hypo

islet

kidney

liver

combined

3329

3328

all but adipose

38

1

3

0

0

0

1

3382

3381

all but kidney

1

2

16

1

38

5

3

3369

3367

all but kidney

0

5

0

1

33

0

1

3367

3365

all but kidney

1

4

3

3

35

2

3

3334

3333

all but kidney

0

1

0

0

35

0

0

3385

3384

all but kidney

4

1

12

2

42

2

3

3363

3361

all but kidney

1

4

3

1

36

0

2

3356

3354

all but kidney

2

4

3

2

30

0

2

3368

3366

all but kidney

7

0

9

1

34

0

3

3352

3351

all but kidney

2

1

6

3

25

0

2

3355

3353

all but kidney

5

6

0

5

42

3

4

3567

3559

all but kidney

0

3

0

1

31

2

1

3406

3405

all but liver

0

1

0

1

0

32

1

3320

3319

many sloppy

11

3

10

10

16

6

10

3326

3325

missing gastroc and kidney

4

36

14

3

31

0

4

3344

3343

missing hypo and kidney

5

0

31

1

37

4

2

3395

3394

only kidney

35

36

35

34

4

35

4

3397

3396

only kidney

32

39

24

32

4

27

4

3339

3338

only kidney

37

37

29

38

0

38

0

3408

3407

only kidney

28

24

26

30

1

32

1

3407

3406

only kidney

45

40

35

47

1

41

1

3399

3398

only kidney

42

35

35

36

0

33

0

3400

3399

only kidney

31

36

30

32

1

35

1

3393

3390

only kidney

34

36

27

35

6

29

6

3396

3395

only kidney

39

41

33

42

3

41

3

3398

3397

only kidney

32

33

32

40

1

28

1

3378

3377

only kidney

35

33

20

36

0

33

0

3390

3389

only kidney

33

24

38

27

0

27

0

3401

3400

only kidney

40

43

37

41

1

31

1

3405

3404

only kidney

31

24

23

26

1

24

1

3359

3357

sloppy hypo

3

8

33

0

0

5

5

3366

3364

sloppy hypo

1

1

29

2

0

0

3

3374

3373

sloppy hypo

4

4

18

2

3

0

4

3351

3350

sloppy hypo

1

3

17

1

1

5

3

3381

3378

sloppy hypo

4

1

32

3

1

8

6

3370

3368

sloppy hypo

2

4

16

2

7

7

5

3365

3363

sloppy hypo

3

2

18

2

0

2

3

3260

3286

sloppy liver

4

3

3

7

3

21

6

There are two additional samples that deserve to be highlighted, which are two blue points (inferred to be correctly labeled) that are off the diagonal line in one tissue:

  • Mouse3270 shows 39% mismatches (18/46) in liver, and 8% mismatches (34/403) overall.

  • Mouse3142 shows 26% mismatches (9/34) in hypo, and 4% mismatches (14/372) overall.

In both of these cases, I’m inclined to conclude that the expression array data may be of low quality, but the samples are correctly labeled.

4.2. Fix up sample duplicates

The re-alignment of the DNA samples requires that we go back to the inferred sample duplicates and relabel some of them. In particular, the sample for Mouse3269 was found to be a duplicate of the sample for Mouse3259, but the sample for Mouse3259 was actually for Mouse3332, and so both are really Mouse3332. Similarly, the sample for Mouse3354 was a duplicate of the sample for Mouse3353, but the sample for Mouse3353 was actually Mouse3352, and so both are really Mouse3352.

4.3. Summary

Here’s a summary of the problems observed in the genotype data.

okay

435

omitted

10

duplicate

5

maybe okay

8

error

96

total

554

5. X chromosome and sex

5.1. Genotypes

I originally came to these observations by identifying a set of samples for which sex did not match the X chromosome genotype. It would be good to revisit that issue: with the genotypes realigned, do the sexes now match what was observed in the X chromosome data?

We first grab the set of DNAs for which we are now sure of the appropriate labels. (There are 519 such.)

We want to compare the inferred sex from the X chromosome genotypes for the samples to the documented sex (attached to the correct label).

Of the 519 DNA samples that are correct or can be corrected, 23 showed a mismatch between sex and X chromosome genotypes. With the corrected sample labels, there are 3 discrepancies:

Original ID

Corrected ID

Sex from genotypes

Sex from necropsy

Mouse3368

Mouse3366

Male

Female

Mouse3393

Mouse3390

Male

Female

Mouse3399

Mouse3398

Female

Male

But all three of these mice are homozygous (or hemizygous) BTBR across the entire X chromosome, and so the X chromosome genotypes are compatible with either sex. Thus, there are no real discrepancies between the X chromosome genotypes and the sex of the mice, once the labels for the DNA samples have been corrected.

RR

BR

BY

RY

Mouse3368

0

0

0

20

Mouse3393

0

0

0

20

Mouse3399

19

0

0

0

5.2. Gene expression

We can also look at the expression of the Xist gene (which should be high in females and 0 in males) and of Y chromosome genes (which should be >0 in males and 0 in females).

In the following, I plot the average expression of 3 Y chromosome genes against the expression of Xist, with points colored by the sex of the sample (blue is male; red is female). (The expression arrays contained 21 probes for genes on the Y chromosome, but only a portion of them showed clear sex differences in expression in all tissues, and so I focused on these.)

Figs/fig-XandYexprfig.jpg

Hypo is a bit of a mess, with a bunch of females having unusually low Xist expression and a bunch of males having unusually low Y chromosome expression; this is almost entirely explained by a bad batch of arrays that we ultimately removed.

The other five tissues look great. There is one case of a male with relatively high Xist expression in adipose, but the sex assignments of the expression arrays, after the fixing labels, appears to be correct.

The following is the plot for hypo, without the bad batch of arrays:

Figs/fig-hypo_XandYexprfig.jpg

6. QTL mapping results, before and after

I now will compare the QTL mapping results, before and after the correction of the sample mixups. I will look at insulin (a particularly important trait), agouti coat color and tufted coat (both simple Mendelian phenotypes), and eQTL for all tissues.

I first need to re-align the genotype data. I will drop all samples whose labels are not clear (that includes any DNA without associated expression data).

One problem with the data in the original form are the 28 mice whose X chromosome genotypes don’t match their sex. For these, I will use the correct sex and omit the X chromosome genotype data.

6.1. Insulin

First, I’ll look at insulin levels at 10 weeks; we actually have three different measures of insulin at 10 weeks, and we also have insulin at 4, 6 and 8 weeks, but I’ll just look at one of the 10 week measures. I’ll use the log scale and will include sex as an interactive covariate (that is, allowing QTL to have different effect in the two sexes).

In the original data, there were 539 mice; in the corrected data, we are left with 519 mice. The following are the LOD curves, with those for the original data in red and those for the corrected data in blue.

Figs/fig-insulin_lod_fig.jpg

With the original data, there were 3 chromosomes having a LOD score ≥ 5; with the corrected data, there are 7 chromosomes with LOD ≥ 5.

Here is a summary table of the peak LOD on chromosomes with inferred QTL.

Chr

LOD (old)

LOD (new)

Pos’n (old)

Pos’n (new)

Increase in LOD

2

7.05

9.21

70.6

71.0

2.16

5

3.00

5.08

56.2

54.7

2.09

6

3.56

5.30

43.6

43.1

1.74

7

4.73

6.31

87.4

86.0

1.58

9

3.31

6.72

49.0

46.2

3.41

12

5.58

5.83

10.5

10.0

0.25

19

6.59

6.79

43.8

44.3

0.20

The LOD scores on chromosomes 2, 5, 6, 7, and 9 went up substantially. The peak locations are largely unchanged.

6.2. Agouti coat

For most mice, it was recorded whether they had an agouti coat or not. This is due to a single gene (a, non-agouti) on chromosome 2 at 154.6-154.8 Mbp.

In both the original and corrected data, agouti coat color maps strongly to chromosome 2. The LOD score increased from 64 to 110 with the correction of the sample mixups. In both cases, the peak is at 154.4 Mbp, right next to the true gene.

Figs/fig-agouti_lod_fig.jpg

If we grab the mice with good genotype data at the agouti locus, we obtain the following relationship between agouti genotype and phenotype for the original data (with alleles B = B6 and R = BTBR):

Agouti

Tan

Black

Chr 2 genotype

BB

26

114

BR

249

15

RR

88

6

With the corrected data, we get the following table.

Agouti

Tan

Black

Chr 2 genotype

BB

5

126

BR

255

2

RR

92

0

With the original data, there are 47 mismatches out of 498 mice. With the corrected data, there are 7 mismatches out of 480 mice.

So there are still 7 mismatches remaining, but the results are considerably improved.

Here are the actual mice with mismatching agouti genotype/phenotype:

pheno

geno

Mouse3159

Tan

BB

Mouse3179

Black

BR

Mouse3180

Tan

BB

Mouse3225

Tan

BB

Mouse3394

Tan

BB

Mouse3507

Tan

BB

Mouse3586

Black

BR

The genotypes for the mismatching individuals seem clear and are not likely genotyping errors. Here are the genotypes on chromosome 2 for these seven mice, with the agouti locus indicated by a green vertical line. The top five have tan coats and should be BR (gray points) or RR (black points) at the agouti locus, but are all BB (white points). The bottom two have black coats and should be BB (white points) at the agouti locus but are BR (gray points).

Figs/fig-agouti_mismatch_geno.jpg

6.3. Tufted coat

For most mice, we also have information on whether they had a tufted coat or not. This is due to a single gene on chromosome 17, though it seems that the actual gene is not yet known.

In both the original and corrected data, tufted coat maps strongly to chromosome 17, though the LOD score increased from 64 to 107 with the correction of the sample mixups. In both cases, the peak is at marker rs3700924, at 27.2 Mbp.

Figs/fig-tufted_lod_fig.jpg

If we grab the mice with good genotype data at the tufted locus, we obtain the following relationship between tufted genotype and phenotype for the original data (with alleles B = B6 and R = BTBR):

Tufted coat

No

Yes

Chr 17 genotype

BB

151

7

BR

258

9

RR

21

92

With the corrected data, we get the following table.

Tufted coat

No

Yes

Chr 17 genotype

BB

153

0

BR

256

0

RR

4

106

With the original data, there are 37 mismatches out of 538 mice. With the corrected data, there are 4 mismatches out of 519 mice.

So there are still 4 mismatches remaining, but, as with agouti coat, the results are considerably improved after the correction for the sample mixups.

Here are the actual mice with mismatching tufted genotype/phenotype:

pheno

geno

Mouse3152

Not tufted

RR

Mouse3154

Not tufted

RR

Mouse3275

Not tufted

RR

Mouse3603

Not tufted

RR

The genotypes for the mismatching individuals seem clear and are not likely genotyping errors. Here are the genotypes on chromosome 17 for these four mice, with the tufted locus indicated by a green vertical line. These all have non-tufted coats and should be BB (white gray) or BR (gray points) but are all RR (black points) at the tufted locus.

Figs/fig-tufted_mismatch_geno.jpg

6.4. eQTL analysis

Finally, we will map the expression of genes in all six tissues, with both the original data and the corrected data. We will focus on probes for which we were able to infer a genomic location, and on an autosome or the X chromosome (i.e., not the Y or mitochondria). There are 37797 such probes.

For all probes, we transform the expression measures to normal quantiles, and we include sex as an interactive covariate in the QTL analyses.

For the hypo expression data, we omitted a batch of 119 arrays that were found to be bad. For each transcript in each tissue, we used Haley-Knott regression to calculate LOD curves across the chromosome. For each transcript, we find the maximum LOD score on each chromosome and count the number of chromosomes that have LOD ≥ 5. A transcript is said to have a local-eQTL if the LOD score on its chromosome is ≥ 5 and the genomic location of the transcript is within the 2-LOD support interval for that eQTL; otherwise, a peak above 5 on that chromosome is considered a trans-eQTL.

Here is a figure with the numbers of local- and trans-eQTL for each tissue, with the original data and with the corrected data.

Figs/fig-neqtl_table.jpg

Across all tissues, there were 16946 local-eQTL and 69332 trans-eQTL with the original data. After correcting the sample mixups, these numbers increased to 18183 and 95152 for local-eQTL and trans-eQTL, respectively. These are increases of 7% and 37%, respectively.

6.5. eQTL analysis, LOD > 10

In response to a question from a reviewer, let’s repeat this, looking for eQTL with LOD ≥ 10.

Figs/fig-neqtl_table_10.jpg

Across all tissues, there were 9527 local-eQTL and 12262 trans-eQTL with the original data. After correcting the sample mixups, these numbers increased to 10339 and 17470 for local-eQTL and trans-eQTL, respectively. These are increases of 9% and 42%, respectively.

7. Summary

We have identified strong support for a large number of DNA sample mixups in this project; we further identified a much smaller number of sample mixups within each set of RNA samples. Fortunately, we are able to correct the majority of these problems and recover from the mistakes.

As should be expected, correcting these sample mixups leads to great improvement in the QTL mapping results, both for clinical phenotypes, such as insulin level, and in the expression traits.